Make it easier for user to search for tags by ikuyarihS · Pull Request #542 · python-discord/bot

ikuyarihS · 2019-10-18T03:41:32Z

Closes #231

Applying the algorithm for Needles and Haystack to find and match tag in tags, for example:

This only applies to searching tag_name with more than 3 in length, and at least 80% of its letters are found, from left to right.

There are 3 levels of checking, stop at first found:

Check if exact name ( case insensitive ) O(1) getting from a dictionary Dict[str, Tag]
Check for all tags that has 100% matching via algorithm
Check for all tags that has >= 80% matching

If there are more than one hit, it will be shown as suggestions:

In order to avoid api being called multiple times, I've implemented a cache to only refresh itself when the is a gap of more than 5 minutes from the last api call to get all tags.

Editing / Adding / Deleting tags will also modify the cache directly.

What about other solution like fuzzywuzzy?

fuzzywuzzy was considered for using, but from testing, it was giving much lower scores than expected:

Code used to test:

from fuzzywuzzy import fuzz

def _fuzzy_search(search: str, target: str) -> bool:
    found = 0
    index = 0
    _search = search.lower().replace(' ', '')
    _target = target.lower().replace(' ', '')
    for letter in _search:
        index = _target.find(letter, index)
        if index == -1:
            break
        found += index > 0
    # return found / len(_search) * 100
    return (
        found / len(_search) * 100,
        fuzz.ratio(search, target),
        fuzz.partial_ratio(search, target)
    )

tests = (
    'this-is-gonna-be-fun',
    'this-too-will-be-fun'
)

for test in tests:
    print(test, '->', _fuzzy_search('this too fun', test))

Result from test:

this-is-gonna-be-fun -> (30.0, 50, 50)
this-too-will-be-fun -> (90.0, 62, 58)

#### Closes #231 Applying the algorithm for `Needles and Haystack` to find and match tag in tags, for example: ![Example](https://cdn.discordapp.com/attachments/634243438459486219/634592981915140107/unknown.png) This only applies to searching tag_name with more than 3 in length, and at least 80% of its letters are found, from left to right. There are 3 levels of checking, stop at first found: - Check if exact name ( case insensitive ) O(1) getting from a dictionary Dict[str, Tag] - Check for all tags that has 100% matching via algorithm - Check for all tags that has >= 80% matching If there are more than one hit, it will be shown as suggestions: ![Suggestions](https://cdn.discordapp.com/attachments/634243438459486219/634595369531211778/unknown.png) In order to avoid api being called multiple times, I've implemented a cache to only refresh itself when the is a gap of more than 5 minutes from the last api call to get all tags. Editing / Adding / Deleting tags will also modify the cache directly. ##### What about other solution like fuzzywuzzy? fuzzywuzzy was considered for using, but from testing, it was giving much lower scores than expected: Code used to test: ```py from fuzzywuzzy import fuzz def _fuzzy_search(search: str, target: str) -> bool: found = 0 index = 0 _search = search.lower().replace(' ', '') _target = target.lower().replace(' ', '') for letter in _search: index = _target.find(letter, index) if index == -1: break found += index > 0 # return found / len(_search) * 100 return ( found / len(_search) * 100, fuzz.ratio(search, target), fuzz.partial_ratio(search, target) ) tests = ( 'this-is-gonna-be-fun', 'this-too-will-be-fun' ) for test in tests: print(test, '->', _fuzzy_search('this too fun', test)) ``` Result from test: ```py this-is-gonna-be-fun -> (30.0, 50, 50) this-too-will-be-fun -> (90.0, 62, 58) ```

kosayoda · 2019-10-24T08:31:50Z

Looking at the fuzzy search, have you considered the built-in difflib module?

MarkKoz · 2019-10-24T20:38:03Z

We actually discussed if it'd be better to do the fuzzy search server-side on the API. I haven't looked into it deeply but here are some relevant links:

https://docs.djangoproject.com/en/2.2/ref/contrib/postgres/search/
https://github.com/vsemionov/django-rest-fuzzysearch

I'm not sure if it'd be better to do it server or client side. I think that if there is room for fuzzy search to be used in the future with other endpoints (new or existing), then it should be server side. Another factor would be to see how accurate the pg search features are for our needs here.

ikuyarihS · 2019-10-25T02:55:07Z

Looking at the fuzzy search, have you considered the built-in difflib module?

I've taken a look at it, it proves to be quite useful to get the differences in before and after in the on_message_edit event. I've looked at its SequenceMatcher, it provides similar result to fuzzywuzzy

...
s = difflib.SequenceMatcher(lambda x: x in ' -', search, target)
return (
    found / len(_search) * 100,
    ('fuzzy', fuzz.ratio(search, target), fuzz.partial_ratio(search, target)),
    ('difflib', tuple(map(lambda x: x * 100, (s.ratio(), s.real_quick_ratio(), s.quick_ratio()))))
)

# --------------------------------
this too fun & this-is-gonna-be-fun -> (30.0, ('fuzzy', 50, 50), ('difflib', (50.0, 75.0, 50.0)))
this too fun & this-too-will-be-fun -> (90.0, ('fuzzy', 62, 58), ('difflib', (62.5, 75.0, 62.5)))

I've thought about either this should be done from the API or from the bot, I think having a cache on the bot will give better performance, specially if we do not modify tags from the site-side and restrict modifying tags to be via bot's commands only, then we can maintain a cache that's perfectly synced with the site.

Postgres search feature looks powerful too, I'll definitely want to see how it performs as well.

SebastiaanZ · 2019-11-05T08:07:35Z

Did we decide on an approach for this? Bot-side or server-side? I kinda like the idea of shipping a query off to the API and having postgres do its thing.

MarkKoz · 2019-11-05T08:40:35Z

Given #388 it's better to keep it client-side as it would eventually have to be client-side anyway. However, that issue is stale so I don't know if we still want to do that. If not, then I agree with doing it on the server-side.

SebastiaanZ · 2019-11-06T11:24:12Z

That's a good point; I'd forgotten about that. Let's ask our tag master, @fiskenslakt, what his current opinion on the matter is to make sure we get this thing moving again.

scragly · 2019-11-15T15:51:54Z

I closed #388. The meta repo now contains markdown files of any of our tags for now for the public to read through and be able to submit PRs for adding or editing tags.

At the moment the process of adding or editing the tags is done via bot command in-server or by using the site's admin page (mods+ now have full access to the tag admin page).

There's improvements that can be done to make things easier and to automate/integrate the process, but I'm of the opinion that tags will continue to live on the database, be accessible via API and editable via web admin, and as such we should probably stick to doing fuzzymatching api-side.

MarkKoz

Since this is already done and there has been no progress re-implementing this on the site, I think it is best to get this PR merged for the time being.

- Changed type of `self._last_fetch` to `float` and give it the initial value of `0.0` instead of `None` - Assigned `time.time()` to `time_now` to avoid calling this function twice. - Added `self._last_fetch = time_now` after calling the api call.

…ciency. - Matching scores will be calculated once now and stored in the dict `scores`. - Allow `_get_suggestions()` to go through a list of score threshold and return the first list of matching tags that's not empty and above the threshold. This avoid calling the function multiple time like before ( `self._get_suggestions(tag_name, 100) or self._get_suggestions(tag_name, 80)` for example, is calling this function twice, and is inefficient ) - Deleted commented line. - Added `typing` module for more typehints.

Addressed

MarkKoz

For a tag named foo-bar, foobars will not match and neither will foo_bar. foobar does match. The tags command doesn't seem to like spaces in tag names - it will never match.

- Added a regex to remove non-alphabet ( `[^a-z]` with `re.IGNORECASE` )

… 60] - Since it is returning as soon as there are suggestions found for a threshold, this will give a better reflection of what the bot thinks user is searching for.

ikuyarihS · 2020-02-04T21:03:11Z

Interesting! I've added a regex to remove all non-alphabet, as well as increasing threshold from [100, 80] to [100, 90, 80, 70, 60] since it stops as soon as suggestions are found for a threshold, this will give better suggestions that the bot thinks is what the user is searching for. This solved for both the foo_bar and foobars when searching for foo-bar

MarkKoz · 2020-02-04T21:45:26Z

In some cases that still isn't working so well:

asks returns args-kwargs instead of ask
foos returns off-topic and functions-are-objects instead of foo
dict returns iterate-dict without considering dictcomps too
opens returns both scope and open when the latter is obviously a much better mach

Also discovered an unrelated issue in which it can't handle DELETE or GET requests for tags with spaces in them (returns 404). Might be a URL encoding issue since the tag is part of the URL path. It can POST fine because the tag name is instead part of the JSON.

ikuyarihS · 2020-02-05T04:48:19Z

Hmm, I've added another complexity that will force this to search from words to words, here's the snippets I used to test

import re
from typing import Dict, List, Optional

REGEX_NON_ALPHABET = re.compile(r"[^a-z]", re.MULTILINE & re.IGNORECASE)

stuff = ['args-kwargs', 'ask', 'class', 'classmethod', 'codeblock', 'decorators', 'dictcomps', 'enumerate', 'except', 'exit()', 'f-strings', 'foo', 'functions-are-objects', 'global', 'if-name-main', 'indent', 'inline', 'iterate-dict', 'listcomps', 'mutable-default-args', 'names', 'no-dm',
         'off-topic', 'open', 'or-gotcha', 'param-arg', 'paste', 'pathlib', 'pep8', 'positional-keyword', 'precedence', 'quotes', 'relative-path', 'repl', 'return', 'round', 'scope', 'seek', 'self', 'star-imports', 'traceback', 'windows-path', 'with', 'xy-problem', 'ytdl', 'zen', 'zip', ]

_cache = dict(zip(stuff, stuff))


def _fuzzy_search(search: str, target: str) -> int:
    """A simple scoring algorithm based on how many letters are found / total, with order in mind."""
    current, index = 0, 0
    _search = REGEX_NON_ALPHABET.sub('', search.lower())
    _targets = iter(REGEX_NON_ALPHABET.split(target.lower()))
    _target = next(_targets)
    try:
        while True:
            while index < len(_target) and _search[current] == _target[index]:
                current += 1
                index += 1
            index, _target = 0, next(_targets)
    except (StopIteration, IndexError):
        pass
    return current / len(_search) * 100


def _get_suggestions(tag_name: str, thresholds: Optional[List[int]] = None) -> List[str]:
    """Return a list of suggested tags."""
    scores: Dict[str, int] = {
        tag_title: _fuzzy_search(tag_name, tag)
        for tag_title, tag in _cache.items()
    }

    thresholds = thresholds or [100, 90, 80, 70, 60]

    for threshold in thresholds:
        suggestions = [
            _cache[tag_title]
            for tag_title, matching_score in scores.items()
            if matching_score >= threshold
        ]
        if suggestions:
            return f"{repr(tag_name)} - {suggestions}"

    return f"{repr(tag_name)} not found"


print(_get_suggestions('fstring'))
print(_get_suggestions('fstrings'))
print(_get_suggestions('fstr'))
print(_get_suggestions('f-str'))
print(_get_suggestions('f-string'))
print(_get_suggestions('f-strings'))
print(_get_suggestions('asks'))
print(_get_suggestions('foos'))
print(_get_suggestions('dict'))
print(_get_suggestions('opens'))
print(_get_suggestions('or'))
print(_get_suggestions('or-g'))
print(_get_suggestions('or-'))
print(_get_suggestions('got'))
print(_get_suggestions('path'))
print(_get_suggestions('main'))
print(_get_suggestions('if'))
print(_get_suggestions('if main'))
print(_get_suggestions('asdfasdf'))

Here are the results:

'fstring' - ['f-strings']
'fstrings' - ['f-strings']
'fstr' - ['f-strings']
'f-str' - ['f-strings']
'f-string' - ['f-strings']
'f-strings' - ['f-strings']
'asks' - ['ask']
'foos' - ['foo']
'dict' - ['dictcomps', 'iterate-dict']
'opens' - ['open']
'or' - ['or-gotcha']
'or-g' - ['or-gotcha']
'or-' - ['or-gotcha']
'got' - ['or-gotcha']
'path' - ['pathlib', 'relative-path', 'windows-path']
'main' - ['if-name-main']
'if' - ['if-name-main']
'if main' - ['if-name-main']
'asdfasdf' not found

- Added regex back to sub and split by non-alphabet. - Now use two pointers to move from words to words.

MarkKoz

That's working much better.

Akarys42

Works great!

ikuyarihS added t: feature New feature or request area: cogs p: 3 - low Low Priority labels Oct 18, 2019

ikuyarihS requested review from SebastiaanZ and sco1 October 18, 2019 03:41

ikuyarihS self-assigned this Oct 18, 2019

SebastiaanZ unassigned ikuyarihS Oct 25, 2019

scragly removed the area: cogs label Nov 14, 2019

scragly added a: API Related to or causes API changes a: information Related to information commands: (doc, help, information, reddit, site, tags) s: stalled Something is blocking further progress type: Enhancement and removed t: feature New feature or request labels Nov 15, 2019

lemonsaurus added t: feature New feature or request and removed type: enhancement labels Dec 15, 2019

Merge branch 'master' into fuzzy-tag-search

b71acff

jb3 requested a review from a team as a code owner February 2, 2020 22:52

MarkKoz previously requested changes Feb 3, 2020

View reviewed changes

Comment thread bot/cogs/tags.py Outdated

Comment thread bot/cogs/tags.py Outdated

Comment thread bot/cogs/tags.py Outdated

MarkKoz added s: waiting for author Waiting for author to address a review or respond to a comment and removed s: stalled Something is blocking further progress a: API Related to or causes API changes labels Feb 3, 2020

ikuyarihS added status: needs review and removed s: waiting for author Waiting for author to address a review or respond to a comment labels Feb 4, 2020

MarkKoz reviewed Feb 4, 2020

View reviewed changes

ikuyarihS added 2 commits February 5, 2020 04:00

Removed non-alphabets from both search and tag_name when scoring.

a38926f

- Added a regex to remove non-alphabet ( `[^a-z]` with `re.IGNORECASE` )

Increased default thresholds from just [100, 80] to [100, 90, 80, 70,…

a6341b1

… 60] - Since it is returning as soon as there are suggestions found for a threshold, this will give a better reflection of what the bot thinks user is searching for.

Removed regex, implemented a stricter letter searching.

c054790

Made searching even stricter by searching from start of each word

8dd66bc

- Added regex back to sub and split by non-alphabet. - Now use two pointers to move from words to words.

MarkKoz approved these changes Feb 6, 2020

View reviewed changes

Akarys42 approved these changes Feb 7, 2020

View reviewed changes

Merge branch 'master' into fuzzy-tag-search

4a685ed

ikuyarihS merged commit e205bf6 into master Feb 7, 2020

ikuyarihS deleted the fuzzy-tag-search branch February 7, 2020 08:27

Uh oh!

Conversation

ikuyarihS commented Oct 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Closes #231

What about other solution like fuzzywuzzy?

Uh oh!

kosayoda commented Oct 24, 2019

Uh oh!

MarkKoz commented Oct 24, 2019

Uh oh!

ikuyarihS commented Oct 25, 2019

Uh oh!

SebastiaanZ commented Nov 5, 2019

Uh oh!

MarkKoz commented Nov 5, 2019

Uh oh!

SebastiaanZ commented Nov 6, 2019

Uh oh!

scragly commented Nov 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarkKoz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MarkKoz left a comment

Choose a reason for hiding this comment

Uh oh!

ikuyarihS commented Feb 4, 2020

Uh oh!

MarkKoz commented Feb 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikuyarihS commented Feb 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarkKoz left a comment

Choose a reason for hiding this comment

Uh oh!

Akarys42 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ikuyarihS commented Oct 18, 2019 •

edited

Loading

scragly commented Nov 15, 2019 •

edited

Loading

MarkKoz commented Feb 4, 2020 •

edited

Loading

ikuyarihS commented Feb 5, 2020 •

edited

Loading